Speaker-independent 3D face synthesis driven by speech and text
نویسندگان
چکیده
In this study, a complete system that generates visual speech by synthesizing 3D face points has been implemented. The estimated face points drive MPEG-4 facial animation. This system is speaker independent and can be driven by audio or both audio and text. The synthesis of visual speech was realized by a codebook-based technique, which is trained with audio-visual data from a speaker. An audio-visual speech data set in Turkish language was created using a 3D facial motion capture system that was developed for this study. The performance of this method was evaluated in three categories. First, audio-driven results were reported, and compared with the time-delayed neural network (TDNN) and recurrent neural network (RNN) algorithms, which are popular in speech-processing field. It was found out that TDNN performs best and RNN performs worst for this data set. Second, results for the codebook-based method after incorporating text information were given. It was seen that text information together with the audio improves the synthesis performance significantly. For many applications, the donor speaker for the audio-visual data will not be available to provide audio data for synthesis. Therefore, we designed a speaker-independent version of this codebook technique. The results of speaker-independent synthesis are important, because there are no comparative results reported for speech input from other speakers to animate the face model. It was observed that although there is small degradation in the trajectory correlation (0.71–0.67) with respect to speaker-dependent synthesis, the performance results are quite satisfactory. Thus, the resulting system is capable of animating faces realistically from input speech of any Turkish speaker. r 2006 Published by Elsevier B.V.
منابع مشابه
End-to-end Learning for 3D Facial Animation from Raw Waveforms of Speech
We present a deep learning framework for realtime speech-driven 3D facial animation from just raw waveforms. Our deep neural network directly maps an input sequence of speech audio to a series of micro facial action unit activations and head rotations to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual informa...
متن کاملThe UWB 3d talking head text-driven system controlled by the SAT method used for the LIPS 2009 challenge
This paper describes the 3D talking head text-driven system controlled by the SAT (Selection of Articulatory Targets) method developed at the University of West Bohemia (UWB) that will be used for participation in the LIPS 2009 challenge. It gives an overview of methods used for visual speech animation, parameterization of a human face and a tongue, and a synthesis method. A 3D animation model ...
متن کاملSpeech-driven facial animation using a hierarchical model - Vision, Image and Signal Processing, IEE Proceedings-
A system capable of producing near video-realistic animation of a speaker given only speech inputs is presented. The audio input is a continuous speech signal, requires no phonetic labelling and is speaker-independent. The system requires only a short video training corpus of a subject speaking a list of viseme-targeted words in order to achieve convincing realistic facial synthesis. The system...
متن کاملRecognition Of Voice Using Mel Cepstral Coefficient & Vector Quantization
Human Voice is characteristic for an individual. The ability to recognize the speaker by his/her voice can be a valuable biometric tool with enormous commercial as well as academic potential. Commercially, it can be utilized for ensuring secure access to any system. Academically, it can shed light on the speech processing abilities of the brain as well as speech mechanism. In fact, this feature...
متن کاملText-to-speech synthesis with arbitrary speaker's voice from average voice
This paper describes a technique for synthesizing speech with any desired voice. The technique is based on an HMM-based text-to-speech (TTS) system and MLLR adaptation algorithm. To generate speech of an arbitrarily given target speaker, speaker-independent speech units, i.e., average voice models, is adapted to the target speaker using MLLR framework. In addition to spectrum and pitch adaptati...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Signal Processing
دوره 86 شماره
صفحات -
تاریخ انتشار 2006